Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks

With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We propose an efficient and accurate frailty model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrate the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genome-wide scale phenome-wide association results with the PheWeb browser.

denotes the indicator function. For each subject, we have the observed event status-time We call t i a failure time for subject i if δ i = 1, and a censoring time if δ i = 0. Without loss of generality, we assume t i s are in increasing order, i.e, Let X i denote the q × 1 vector of covariates (excluding the intercept) to adjust for, and G i denote the genotype (0, 1, or 2) of the variant we are testing. Throughout this paper, quantities without a subscript is used to denote the vectors of corresponding quantities with subscripts (eg. G = (G 1 , . . . , G N )), when it is clear from the context. Denote X to be the n × q covariate matrix with i-th row as X i . Then, in a frailty model, we denote the conditional hazard function for subject i at time t given the random effects b i s, t 0 λ 0 (u)du is the cumulative baseline hazard (CBH) function. Under multivariate Gaussian frailty, the random effects b ∼ N (0, τ V), where τ is the variance component, and V is the N × N genetic relationship matrix (GRM). We further assume that conditional on b, the censoring is independent and non-informative of b. Then the likelihood of the observed data is, Then we can approximate the integral using the Laplace approximation, is the solution to f (b) = 0. Therefore, the log-likelihood can be approximated by, (1) with respect to β and τ is similar to the Poisson GLMM log-likelihood. 1, 2 Following Breslow and Clayton, 2 assuming the weight matrix W changes slowly as a function of the mean, we can maximize the following penalized log-likelihood 3 given τ , to obtain the maximum likelihood estimators (MLEs)

Estimation of the baseline hazard, fixed effects and random effects given the variance component
Lets denote the unique failure times to be t * 1 < t * 2 < . . . < t * K in increasing order. Following Breslow's suggestion, 4 we assume the baseline hazard function to remain constant between successive event times, and according to Kalbfleisch and Prentice's 5 convention, we consider the censored observations to be censored at the previous failure time. In section 6 we show that the maximum likelihood estimation under these assumptions is equivalent to maximizing a partial likelihood. 6 Formally, lets denote λ . Then the cumulative hazard function can be written as a step function, with the convention t * 0 = t 0 = 0. Then, using algebraic manipulations with the order of the summations, we can write, where R(t i ) = {j : t j ≥ t i } denotes the set of subjects at risk at time t i . This definition of the at-risk set corresponds to Breslow's approximation when there are tied failure times. 7 We can express p as a function of (α, β, γ, b), and obtain the score functions by taking derivatives of p with respect to β, γ, b, and α i -s, where Z i is an N × 1 vector with i-th element 1 and rest 0s, and d i = j:t j =t * i δ j is the number of failures at time t * i . Setting ∂ p /∂α i = 0, we obtain the MLEs, This leads to the famous Breslow's estimator for the CBH, .

Estimation of the variance component
Givenα =α(τ ),β =β(τ ) estimated, from (1) the log-likelihood of the variance component can be derived as, We maximize the corresponding restricted maximum-likelihood (REML), The score function with respect to τ are given by, The corresponding observed information function, and the expected information function are given by respectively. Evaluating both observed and expected information functions involves computationally expensive trace computations. To avoid the trace computations, the average information is used in the AI-REML 1, 8, 9 algorithm. The average information is expressed as the average of J τ and E(J τ ),

Algorithm to fit the null mixed model
The null model fitting algorithm can be summarized as, 1. Fit a Poisson linear model wth τ = 0 to get initial estimatesβ (0) and working outcome vector Y (0) . 3. At the i-th step, updateτ usingτ

Score test
The score test statistic under the null hypothesis is given by, whereG = G −X X ŴX −1X Ŵ G is the covariate-and-intercept adjusted genotype vector, andX = 1 X is the intercept-augmented covariate matrix. The information matrix corresponding to the score equations in (3) is given by, where, A ij and A •i denote the (i, j)-th element and i-th column of a matrix A, respectively. Then, the variance of the score statistic under H 0 is given by, , for i = 1, . . . , K, j = 1, . . . , N, where v i is an n-dimensional vector with elements v ij = 0 if j / ∈ R(t * i ), and v ij = exp(η j ) if j ∈ R(t * i ).

2 Approaches to reduce computation and memory cost
To obtain V ar H 0 (T ) =G QG , we need to compute quantities of the formŜ −1 a, where a is a vector. The standard computation technique of invertingŜ (computation cost O(N 3 )) and multiplyingŜ −1 with a can be extremely time consuming when N is large. On top of that, it will require the storage ofŜ −1 (memory cost O(N 2 )) which can have extremely high memory requirement. In order to reduce the computation and memory cost, we implemented several strategies similar to BOLT-LMM 10 and SAIGE. 9 Firstly, instead of storing the N × N GRM which can cost 4N (N + 1) bytes if stored using double precision floating point numbers, we only store the raw genotypes as binary vectors which only costs N M/4 bytes, where M is the number of markers used to calculate the GRM. We calculate the elements of the GRM from the raw genotype vectors only when they are needed. Secondly, to compute quantities of the formŜ −1 a, we implemented the pre-conditioned conjugate gradient 11 (PCG) method, which computesŜ −1 a = b by solving the linear system of equationsŜb = a. Thirdly, since U = ΓR DRΓ has a low rank decomposition with rank(U) = K, the number of unique failure times, to calculate (W − U) −1 G in the PCG steps, we leverage the Woodbury identity decomposition The matrix (−D −1 + RΓW −1 ΓR T ) is a K × K matrix which can be inverted easily when K is small, or when K is also larger, PCG can be used to obtain Similar to SAIGE, we further implemented the Hutchinson's randomized trace estimation 12, 13 for calculating tr(PV). We also implemented multithreaded parallel computation for the matrix-vector multiplications in the PCG steps using Intel Threading Building Block (TBB) from the RcppParallel 14 package.

Variance ratio approximation
Computation of the variance of the score statistic V ar H 0 (T ) =G QG requires calculatinĝ QG repeatedly for all markers, which is computationally expensive. To avoid calculatinĝ QG for all the markers, we estimate the variance ratior =G QG /G ŴG using a small set of markers, and then approximate the variance of the score statistic for all markers bŷ rG ŴG in step 2. This saves substantial computation time sinceŴ is a diagonal matrix. Such variance ratio approximation approaches were previously used in various linear 10,15,16 and logistic mixed effects models 9 to speed up computation. Here, we provide a theoretical justification for why such an approximation works.
Let PX =X(X ŴX ) −1X Ŵ be the weighted projection matrix for the intercept-augmented covariate matrixX. Let E(G i ) = µ g and the covariance matrix of G is given by σ 2 g Ψ, where Ψ is the correlation matrix of G. When the elements of G follows the Bin(2, p g ) distribution, then µ g = 2p g , and σ 2 g = 2p g (1 − p g ). The matrix Ψ represents the kinship matrix, however exact characterization of Ψ is not needed for this proof. Then, E(G) = µ g (I − PX)1 = 0, and Cov(G) = σ 2 g (I − PX)Ψ(I − PX) , where 1 is the N × N vector of all element equal to unity. We scale both the numerator and denominator of the variance ratio by N −1 so that they don't blow to infinity when looked at individually. Then, for the numerator, The ratio on the right-hand side is constant across all markers as the individual limits in the numerator and denominator exist and are bounded away from zero. Our software first estimates the variance ratio using 30 randomly selected markers, and then adds more markers with increments of ten until the coefficient of variation (CV) is smaller than 0.001. We performed a sensitivity analysis on the number of markers used to calculate the variance ratio using the UK Biobank data on white British subjects. For the analysis of the four exemplary phenotypes described in this paper, we used different number of markers (N = 5, 15, 30, 50, 100, 200) to estimate the variance ratios. For each choice of number of markers, we selected the markers randomly 50 times and the variance ratio estimates are presented in Supplementary Figure 15. The results show that the variance ratio estimates remain overall very stable when ≥ 30 markers are used, and the variation decreases with increasing number of markers. 4 Using saddlepoint approximation 17 (SPA) for the null distribution of the score statistic In traditional score tests, the distribution of the score statistic under H 0 is approximated by a Normal distribution, which uses the first two moments, mean and variance. This approach can perform poorly in the tail regions, especially if the underlying distribution is highly skewed when the studying event is very rare or the testing genetic variant has a very low minor allele count (MAC). Here, similar to what has been applied in the logistic mixed models 9, 18-20 previously, we use the SPA to approximate the null distribution of the score statistic to obtain accurate p-values. Based on the likelihood (1), we derive the SPA of T adj =G (δ −μ)/ rG ŴG as a weighted sum of independent Poisson random variables given b. The approximated cumulant generating function (CGF) of T adj and it's derivatives are given by, where c = (G WG) −1/2 . To calculate the probability that T adj < s, where s is the observed test statistic, we use the following formula 2 ,ξ is the solution of K (ξ) = s, and Φ is the standard normal distribution function.

Effect size estimation
Since our method only fits the model under the null hypothesis of no association, it cannot provide the effect size estimates as part of the model fitting process. Instead, to rapidly estimate the effect sizes, we follow a similar approach used in EMMAX, 21 GRAMMAR-Gamma, 15 and SAIGE 9 using the parameter estimates from the null model. Our genetic effect size estimate is given byγ = G QG −1G Q (δ −μ). Notice that this estimate can also be expressed asγ = T /V ar H 0 (T ) where T is the score test statistic, and derived assuming the standardized Wald test statistic to be equal to the standardized score test statistic. In section 3, we showed that V ar H 0 (T ) =G QG ≈rG ŴG . Therefore,γ = T /rG ŴG , which can be estimated using the already estimated quantities T ,r, andG ŴG . The standard errors can be estimated by inverting the p-values. The standard error ofγ, SE(γ) = |γ/z|, where z is the Z-score corresponding to the two-sided association p-value.

Equivalence of penalized full and partial likelihoodbased estimates
For unrelated samples, the equivalence between maximizing the partial likelihood, and maximizing the full likelihood with Breslow's estimator plugged in as the estimate of the CBH function has been shown by Breslow 4 in the proportional hazard model. Here, we show that given the variance component τ , maximizing p (see (2)) assuming that the baseline hazard to be constant between any two consecutive failure times, i.e., estimating the CBH function using Breslow's estimator, is equivalent to maximizing the penalized partial log-likelihood as described in Ripati et al., 6 Plugging inΛ 0 (t) into the score equations corresponding to β, γ in (3), and using algebraic manipulations with the ordering of the summations, we get, .
We observe that the expressions of the score functions are the same as the score functions from the partial likelihood ∂ pp /∂β,∂ pp /∂γ. Equivalence of the information matrices can also be shown similarly.

Sensitivity analysis on the number of markers used to construct the GRM
We performed a sensitivity analysis on the number of markers used to construct the GRM in step 1 of GATE. We compared association results between using 93511 high-quality genotyped markers that were used by the UK Biobank research group to calculate kinship, 22 and 245745 pruned (500kb window, sliding step-size 50 markers, r 2 < 0.2) genotyped markers with MAF ≥ 0.01. We compared the association results between these two marker-sets for the analysis of four example phenotypes in the UK Biobank data on 46 million imputed variants. Manhattan plots (Supplementary Figure 14) were similar between the two markersets, and the scatter plots of the p-values (Supplementary Figure 13) show highly correlated association p-values. P-values using 245745 markers were generally slightly smaller than the p-values using 93511 markers, especially for ischemic heart disease. Similar observation was also made for the analysis of the binary disease status for this phenotype. 9 8 Benchmarking GATE and COXMEG
The median values of the corresponding metric out of five replications are reported.

Computational resource requirements for steps 1 and 2
The computation time and memory requirements for the null model fitting (step 1) and association test (step 2) are as below. In addition to the sample size, the computation time for GATE in step 1 depends on other factors such as the number of steps required for the preconditioned conjugate gradient (PCG) method to converge and estimation of the variance component, especially when the sample-size is small. This explains the non-monotonic nature of the median computation time for GATE step 1 as the sample size increases in the low sample size regime, however, the mean computation times are still monotonically increasing with sample size.